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OBJECTIVE 

Microbes  and  viruses  evolve.  Their  evolution  is  often  more  rapid  and  of  greater  practical 
importance  than  our  own  evolution.  How  can  we  understand,  or  even  predict,  the  evolutionary 
trajectory  of  microbes  as  they  adapt?  For  example,  what  determines  how  quickly,  and  by  what 
specific  mutations,  avian  influenza  viruses  will  adapt  to  novel  human  hosts;  or  how  readily 
infectious  bacteria  will  escape  antibiotics  or  the  human  immune  system? 

In  this  research  program  we  seek  to  combine  mathematical  models  and  statistical 
techniques  to  tackle  this  problem  head-on:  to  infer  from  data  the  determinants  of  microbial 
evolution  with  sufficient  resolution  that  we  can  quantify  their  evolutionary  trajectories,  and 
sometimes  even  predict  the  details  of  their  evolution. 

SCIENTIFIC  BARRIERS 

The  rules  of  evolution  are  simple:  mutations  introduce  variants  into  a  population,  whose 
frequencies  then  change  by  genetic  drift  and  natural  selection.  But  the  resulting  evolutionary 
dynamics  are  extraordinarily  complicated  -  because  they  depend  on  the  so-called  "fitness 
landscape"  that  describes  the  fitness  (reproductive  rate)  associated  with  each  possible  genetic 
type  of  the  organism.  Despite  its  central  importance  in  evolution,  very  little  is  known  about  the 
actual  fitness  landscape  of  any  biological  organism. 

An  expanding  body  of  experimental  data  on  microbial  populations  has  begun  to  provide 
the  empirical  basis  required  to  draw  inferences  about  organismal  fitness  landscapes  and  how 
they  shape  evolution.  Nevertheless,  even  for  simple  organisms,  the  number  of  possible 
genotypes  is  astronomically  large,  and  therefore  the  fitness  landscape  is  very  high  dimensional. 
High-throughput  experiments  on  laboratory  populations  of  microbes  produce  massive  amounts 
of  data,  and  yet  still  not  nearly  enough  data  to  determine  an  entire  fitness  landscape  directly. 

This  presents  the  field  with  several  pressing  questions:  how  do  we  infer  fitness 
landscapes  from  limited  samples  of  genotypes?  Do  statistical  approximations  based  on 
available  data  faithfully  reproduce  the  true  fitness  landscape  and  accurately  predict  the 
dynamics  of  adaptation?  Can  we  leverage  time-series  data  to  learn  more  about  the  fitness 
landscape  and  its  effect  on  microbial  evolution?  These  questions  are  intrinsically  mathematical 
and  statistical  in  nature.  Answering  these  questions  demands  familiarity  with  the  empirical 
literature  on  evolving  microbes  and  how  they  are  interrogated  experimentally;  as  well  as 
familiarity  with  the  mathematical  and  statistical  techniques  required  to  draw  meaningful 
inferences  from  these  data. 

SIGNIFICANCE 

Quantitative  models  of  evolution  have  historically  assumed  simple  models  of  the  fitness 
landscape,  with  no  serious  attempt  to  determine  its  actual  structure  in  nature.  But  the  moment 
has  arrived  when  empirical  data,  analytic  sophistication,  and  computational  tools  make  it 
feasible  to  determine  the  actual  fitness  landscapes  of  some  organisms. 


The  payoffs  of  such  a  research  program  are  potentially  manifold  --  both  for  the 
intellectual  development  of  evolutionary  theory  and  for  practical  applications  to  controlling  viral 
and  microbial  disease.  The  practical  payoffs  hold  particular  interest  for  the  Army,  which 
regularly  exposes  its  war-fighters  to  the  insults  and  risks  of  novel  pathogens. 

APPROACH 

Our  approach  to  inferring  microbial  fitness  landscapes  combines  mathematical  models, 
statistical  techniques,  and  detailed  empirical  data  drawn  from  laboratory  and  wild  populations  of 
bacteria  and  viruses. 

The  quantitative  methods  we  use  are  rooted  in  probability  theory,  stochastic  processes, 
and  PDEs.  Such  techniques  are  required  because  differences  in  fitness  are  understood  as  the 
deterministic,  driving  force  in  an  evolving  population,  which  is  balanced  against  the  stochastic 
forces  of  genetic  drift  and  mutation.  Inferring  the  fitness  landscape  thus  requires  that  we 
discriminate  between  stochastic  effects  of  drift,  and  the  deterministic  effects  of  selection. 

More  specifically,  we  are  working  to  characterize  the  dynamics  of  adaptation  in  forward 
time  on  large  families  of  mathematical  fitness  landscapes;  and  then,  conversely,  leverage 
empirical  data  to  infer  the  fitness  landscape  on  which  an  organisms  is  adapting.  We  are 
exploiting  a  variety  of  techniques,  new  and  old,  to  describe  fitness  lanscapes  -  including 
generalizations  of  the  famous  NK  landscapes  of  Kauffman  and  Levin,  as  well  as  our  recent 
technique  describing  a  landscape  as  a  family  of  distributions  of  mutational  effects.  In  addition  we 
are  developing  mathematical  and  computational  tools,  using  infinite-population  diffusion  limits  of 
standard  Markov  models,  to  simulate  different  fitness  and  mutational  scenarios. 

This  approach  is  entirely  novel.  The  combination  of  a  precise,  mathematical 
understanding  of  forward-time  dynamics  to  provide  a  rigorous  method  for  inferring  the 
determinants  of  evolution  from  data  has  not  yet  been  seriously  attempted  -  and  it  has  the 
potential  to  provide  substantial  and  practical  payoffs. 

ACCOMPISHMENTS 

We  made  tremendous  progress  towards  the  goals  of  our  proposed  ARO  research 
program.  Over  the  course  of  the  grant  we  published  26  papers,  ranging  from  topics  such  as  the 
role  of  epistasis  in  protein  evolution,  the  structure  of  epistasis  along  adaptive  walks,  the 
inference  of  fitness  landscapes  from  time-series  data  or  experimental  evolution,  the  role  of 
deleterious  mutations  during  adaptation,  as  well  as  the  importance  of  frequency-dependent 
effects,  such  as  cooperation,  in  evolving  populations.  Below  I  will  highlight  just  a  few  of  these 
projects,  and  also  provide  a  list  of  all  ARO-funded  publications. 

Inferring  epistasis  from  microbial  evolution  experiments 

Recent  years  have  seen  a  proliferation  of  controlled,  laboratory  experiments  on  evolving 
microbial  populations.  Although  these  experiments  have  produced  examples  of  remarkable 
phenomena  -  e.g.  the  emergence  of  mutator  strains,  of  long-term  frequency-dependent 
selection,  of  novel  metabolic  capabilities,  and  even  multi-cellularity  -  a  synthetic  understanding 
of  how  to  draw  inferences  about  the  forces  that  shaped  the  course  of  evolution  in  these 
populations  is  still  lacking. 

Over  the  past  year,  I  have  begun  initial  foray  into  a  long-term  research  program  on  how 
to  draw  principled  inferences  from  laboratory  evolution  experiments.  I  have  focused  on  how  to 
infer  the  presence  of  epistasis  -  that  is,  interactions  between  genetic  mutations  that  collectively 
influence  phenotype  and  fitness.  The  role  that  epistasis  plays  during  adaptation  remains  an 
outstanding  problem,  which  has  received  considerable  attention  in  recent  years.  Most  of  the 


recent  empirical  studies  are  based  on  ensembles  of  replicate  populations  that  adapt  in  a  fixed, 
laboratory  controlled  condition.  Researchers  often  seek  to  infer  the  presence  and  form  of 
epistasis  in  the  fitness  landscape  from  the  time  evolution  of  various  statistics  averaged  across 
the  ensemble  of  populations.  However,  researchers  lack  a  firm  statistical  framework  for  drawing 
such  inferences. 

Therefore,  I  have  begun  to  develop  a  rigorous  analysis  of  what  quantities,  drawn  from 
time  series  of  such  ensembles  of  experimental  populations,  can  be  used  to  infer  epistasis  in  the 
fitness  landscape.  Along  with  two  post-docs  in  my  group,  we  have  analyzed  the  mean  fitness 
trajectory— that  is,  the  time  course  of  the  ensemble  average  fitness.  We  have  shown  that  for 
any  epistatic  fitness  landscape  and  starting  genotype,  there  always  exists  a  non-epistatic  fitness 
landscape  that  produces  the  exact  same  mean  fitness  trajectory.  Thus,  the  presence  of 
epistasis  is  not  identifiable  from  the  mean  fitness  trajectory.  By  contrast,  we  have  shown  that 
two  other  ensemble  statistics— the  time  evolution  of  the  fitness  variance  across  populations,  and 
the  time  evolution  of  the  mean  number  of  substitutions— can  detect  certain  forms  of  epistasis  in 
the  underlying  fitness  landscape.  This  work  provides  foundational  guidance  to  experimentalists 
who  wish  to  draw  inferences  about  how  genetic  interactions  shape  evolution.  A  paper 
describing  these  results  was  published  in  Evolution.  The  topic  remains  a  central  focus  on 
ongoing  research  in  my  group. 

Epistasis  along  an  adaptive  walk 

In  another  project,  we  have  systematically  studied  how  selection  can  bias  the  amount  of 
epistasis  observed  among  mutations  that  substitute  while  a  population  is  adapting.  Epistasis 
refers  to  non-additive  interactions  among  loci  that  collectively  determine  the  fitness  of  an 
organism.  Such  epistatic  interactions  are  recognized  as  fundamental  to  shaping  the  process  of 
adaptation  in  evolving  populations.  Although  little  is  known  about  the  structure  of  epistasis  in 
most  organisms,  recent  experiments  with  bacterial  populations  have  concluded  that  antagonistic 
interactions  abound  and  tend  to  de-accelerate  the  pace  of  adaptation  over  time. 

We  used  the  NK  mathematical  model  of  fitness  landscapes  to  examine  how  natural 
selection  biases  the  mutations  that  substitute  during  evolution,  based  on  their  epistatic 
interactions.  We  found  that,  even  when  beneficial  mutations  are  rare,  natural  selection  strongly 
biases  the  types  of  mutations  that  will  fix;  more  importantly,  the  form  of  these  biases  change 
substantially  throughout  the  course  of  adaptation.  In  particular,  epistasis  is  less  prevalent  than 
the  neutral  expectation  early  in  adaptation  and  much  more  prevalent  later,  with  a  concomitant 
shift  from  predominantly  antagonistic  interactions  early  in  adaptation  to  synergistic  and  sign 
epistasis  later  in  adaptation. 

We  confirmed  our  conclusions  by  analyzing  data  from  a  recent  microbial  evolution 
experiment.  Our  results  show  that  when  the  order  of  substitutions  is  not  known,  standard 
methods  of  analysis  may  suggest  that  epistasis  retards  adaptation  when  in  fact  it  accelerates  it. 
These  results  have  immediate  implications  for  how  researchers  should  interpret  the  observed 
fitness  contributions  of  mutations  that  substitute  in  a  population  under  selection.  We  published  a 
paper  describing  these  results  in  Evolution. 

Inferring  epistasis  from  genetic  time-series 

Over  the  past  year,  we  have  developed  an  entirely  new  approach  to  inferring  selection 
on  mutations,  which  will  be  especially  useful  for  contemporary  data  on  viruses  and  bacteria. 
Population  geneticists  typically  seek  to  understand  the  selective  forces  responsible  for  patterns 
observed  in  contemporaneous  samples  of  genetic  data.  Recently,  however,  there  has  been  a 
rapid  increase  in  the  availability  of  dynamic  data,  where  the  frequencies  of  segregating  alleles  in 
an  evolving  population  are  monitored  through  time,  both  in  laboratory  experiments  and  and 


natural  populations.  One  important  question  is  whether  the  changes  in  allele  frequencies 
observed  in  such  data  are  the  result  of  natural  selection  or  are  simply  consequences  of  genetic 
drift  or  sampling  noise.  In  principle,  it  seems  that  dynamic  data  should  provide  researchers  with 
more  power  to  detect  and  quantify  selective  forces  while  avoiding  the  assumptions  of 
stationarity  that  are  required  for  many  inference  techniques  based  on  static  samples. 

A  standard  chi-squared-based  likelihood  ratio  test  was  previously  proposed  to  address 
this  problem.  We  have  shown  that  the  chi-squared  test  of  selection  substantially  underestimates 
the  probability  of  Type  I  error,  leading  to  more  false  positives  than  indicated  by  its  P-value, 
especially  at  stringent  P-values.  We  developed  two  methods  to  correct  this  bias.  The  empirical 
likelihood  ratio  test  rejects  neutrality  when  the  likelihood  ratio  statistic  falls  in  the  tail  of  the 
empirical  distribution  obtained  under  the  most  likely  neutral  population  size.  The  frequency 
increment  test  rejects  neutrality  if  the  distribution  of  normalized  allele  frequency  increments 
exhibits  a  mean  that  deviates  significantly  from  zero.  We  characterized  the  statistical  power  of 
these  two  new  tests  for  selection,  and  we  applied  them  to  three  experimental  data  sets.  We 
have  shown  that  both  of  these  new  techniques  have  power  to  detect  selection  in  practical 
parameter  regimes,  such  as  those  encountered  in  fitness  assays  of  microbial  populations.  A 
paper  describing  these  results  is  in  press  at  Genetics. 

Deleterious  mutations  and  adaptation 

In  a  new  theoretical  direction  this  past  year  we  have  also  studied  the  role  of  deleterious 
substitutions  during  adaptation  -  that  is,  the  chance  that  deleterious  mutations  might  fix,  with  no 
productive  side  effect  whatsoever,  while  a  population  is  adapting.  The  literature  on  the  genetics 
of  adaptation  typically  neglects  the  possibility  that  deleterious  mutations  will  fix  in  a  population. 
We  have  shown,  by  contrast,  that  even  when  a  population  is  destined  to  adapt  towards  higher 
fitness  over  the  long  term,  the  first  mutation  to  fix  will  often  decrease  fitness.  In  fact,  in  many 
regimes  of  populations  undergoing  long-term  adaptation,  the  expected  effect  of  the  first 
substitution  is  actually  to  decrease  fitness.  We  demonstrated  these  results  under  two  of  the 
most  widely  used  models  of  fitness  landscapes:  the  house  of  cards  model  of  Kingman  and 
Fisher’s  geometric  model.  Importantly,  we  also  developed  a  simple  intuition  to  help  explain  the 
surprising  prevalence  of  deleterious  substitutions  during  adaptation. 

These  results  have  implications  for  our  understanding  of  adaptation.  First,  our  results 
imply  that  the  common  practice  of  neglecting  deleterious  substitutions  can  lead  to  qualitatively 
incorrect  predictions  for  the  dynamics  of  adaptation.  More  generally,  our  analysis  helps  to 
dispel  the  widespread,  but  mistaken,  impression  that  a  population  below  its  equilibrium  mean 
fitness  will  increase  in  fitness  as  it  approaches  equilibrium.  Finally,  our  results  have  practical 
implications  for  the  expected  pattern  of  substitutions  in  response  to  a  change  in  population  size. 
A  paper  describing  these  results  and  their  implications  is  in  press  at  Evolution. 

Hole  of  epistasis  in  protein  evolution 

An  important  question  in  molecular  evolution  is  whether  an  amino  acid  that  occurs  at  a 
given  site  makes  an  independent  contribution  to  fitness,  or  whether  its  contribution  depends  on 
the  state  of  other  sites  in  the  organism’s  genome  known  as  epistasis.  Work  by  Kondrashov  and 
colleagues  recently  argued  that  epistasis  must  be  pervasive  throughout  protein  evolution, 
because  the  observed  ratio  between  the  per-site  rates  of  non-synonymous  and  synonymous 
substitutions  (dN/dS)  is  much  lower  than  would  be  expected  in  the  absence  of  epistasis. 
However,  when  calculating  the  expected  dN/dS  ratio  in  the  absence  of  epistasis,  Kondrashov 
assumed  that  all  amino  acids  observed  at  a  given  position  in  a  protein  alignment  have  equal 
fitness.  We  relaxed  this  unrealistic  assumption  and  found  that  any  dN/dS  value  can  in  principle 
be  achieved  at  a  site,  without  epistasis;  furthermore,  for  all  nuclear  and  chloroplast  genes  in  the 


Kondrashov  data  set,  we  showed  that  the  observed  dN/dS  values  and  the  observed  patterns  of 
amino-acid  diversity  at  each  site  are  jointly  consistent  with  a  non-epistatic  model  of  protein 
evolution.  These  results  are  important  because  they  highlight  the  need  for  more  nuanced 
techniques,  such  as  the  time-series  methods  discussed  above,  for  inferring  fitness  landscapes. 
We  published  these  results  in  Nature. 

I  have  also  worked  to  understand  how  epistasis  between  sites  in  a  single  protein  may 
influence  the  course  of  its  evolution.  We  used  computational  models  of  thermodynamic  stability 
in  a  ligand-binding  protein  to  explore  the  structure  of  epistasis  in  simulations  of  protein 
sequence  evolution.  Even  though  the  predicted  effects  on  stability  of  random  mutations  are 
almost  completely  additive,  we  found  that  the  mutations  that  fix  under  purifying  selection  are 
enriched  for  epistasis.  In  particular,  the  mutations  that  fix  are  contingent  on  previous 
substitutions:  Although  nearly  neutral  at  their  time  of  fixation,  these  mutations  would  be 
deleterious  in  the  absence  of  preceding  substitutions.  Conversely,  substitutions  under  purifying 
selection  are  subsequently  entrenched  by  epistasis  with  later  substitutions:  They  become 
increasingly  deleterious  to  revert  over  time.  Our  results  imply  that,  even  under  purifying 
selection,  protein  sequence  evolution  is  often  contingent  on  history  and  so  it  cannot  be  predicted 
by  the  phenotypic  effects  of  mutations  assayed  in  the  ancestral  background.  We  published  a 
study  describing  these  results  in  Proceedings  of  the  National  Academy  of  Sciences  USA. 

Detecting  epistasis  between  viral  surface  proteins. 

In  related  work,  I  have  also  generalized  earlier  work  to  detect  epistasis  in  evolving  viral 
proteins.  Previously,  my  group  has  detected  epistasis  between  sites  within  individual  viral 
surface  proteins  undergoing  adaptation.  However,  the  extent  to  which  evolution  of  one  viral 
protein  affects  the  evolution  of  the  other  one  is  unknown.  Therefore,  working  with  colleagues 
from  several  countries,  we  developed  a  novel  phylogenetic  method  for  detecting  the  signatures 
of  genetic  interactions  between  mutations  in  different  genes  -  that  is,  inter-gene  epistasis.  Using 
this  method,  we  showed  that  influenza  surface  proteins  evolve  in  a  coordinated  way,  with 
mutations  in  Hemagglutinin  affecting  subsequent  spread  of  mutations  in  Neuraminidase  and 
vice  versa,  at  many  sites.  Of  particular  interest  was  our  finding  that  the  oseltamivir-resistance 
mutations  in  NA  in  subtype  H1N1  were  likely  facilitated  by  prior  mutations  in  HA.  Our  results 
illustrate  that  the  adaptive  landscape  of  a  viral  protein  is  remarkably  sensitive  to  its  genomic 
context  and,  more  generally,  that  the  evolution  of  any  single  protein  must  be  understood  within 
the  context  of  the  entire  evolving  genome.  We  published  a  paper  describing  these  results  in 
PLoS  Genetics. 

Inferring  epistasis  from  sampled  genotypes 

Measuring  a  microbe’s  fitness  landscape  is  virtually  impossible  in  practice,  because  of 
the  coarse  resolution  of  fitness  measurements  and  because  of  episasis:  the  fitness  contribution 
of  one  locus  may  depend  on  the  states  of  other  loci.  To  account  for  all  possible  forms  of 
epistasis,  a  fitness  landscape  must  assign  a  potentially  different  fitness  to  each  genotype,  the 
number  of  which  increases  exponentially  with  the  number  of  loci. 

To  draw  conclusions  from  a  limited  number  of  sampled  genotypes  whose  fitnesses  can 
be  assayed,  researchers  fit  statistical  models  to  approximate  the  fitness  landscape  based  on 
available  data.  This  situation  is  perhaps  best  illustrated  by  recent  studies  of  the  HIV-1  virus.  HIV 
genotypes  were  sampled  from  infected  patients,  and  assayed  for  reproductive  rate.  Whereas 
the  entire  fitness  landscape  of  HIV-1  consists  of  reproductive  values  for  roughly  10®°° 
genotypes,  only  ~70,000  genotypes  were  sampled.  Researchers  therefore  approximated  the 
fitness  landscape,  based  on  the  measured  data,  by  an  expansion  in  terms  of  main  effects  of  loci 
and  epistatic  interactions  among  loci.  This  presents  the  field  with  several  pressing  questions:  Do 


statistical  approximations  based  on  available  data  faithfully  reproduce  the  relevant  aspects  of 
the  true  fitness  landscape  and  accurately  predict  the  dynamics  of  adaptation? 

We  have  begun  to  address  these  fundamental  questions  about  empirical  fitness 
measurements  and  how  they  inform  our  understanding  of  the  underlying  fitness  landscape.  We 
have  quantified  the  effects  of  approximating  a  fitness  landscape  from  data  in  terms  of  main  and 
epistatic  effects  of  loci.  We  demonstrated  that  such  approximations  are  subject  to  two  distinct 
sources  of  biases  that  each  tend  to  under-estimate  high  fitnesses  and  over-estimate  low 
fitnesses.  Biases  in  the  inferred  landscape  distort  commonly  used  measures  of  epistasis  in  the 
landscape.  As  a  result,  the  inferred  landscape  will  provide  systematically  biased  predictions  for 
the  dynamics  of  adaptation.  We  have  identified  the  same  biases  in  a  computational  RNA- 
folding  landscape,  as  well  as  in  transcription  factor  binding  data,  treated  to  the  same  fitting 
procedure.  Finally,  we  have  developed  a  method  to  ameliorate  these  biases  in  certain 
circumstances.  A  manuscript  describing  these  results  is  under  consideration  at  Proceedings  of 
the  National  Academy  of  Sciences. 

Mathematical  aspects  of  Kimura  diffusions 

The  infinite-population  limits  of  standard  population-genetic  models  are  diff  usion 
processes  that  take  place  on  a  simplex,  is  a  higher  dimensional  generalization  of  a  triangle.  A 
point  of  the  simplex  specifies  the  frequencies  of  the  different  genotypes.  The  coefficients  of 
PDFs  describing  these  limiting  models  contain  information  about  the  relative  effects  of  genetic 
drift,  fitness,  mutation  rates  and  migration.  Because  the  paths  of  the  stochastic  process,  which 
describe  the  time  evolution  of  the  frequencies  of  genotypes,  are  constrained  to  remain  in  a 
simplex,  the  partial  differential  operators  that  generate  these  diffusion  processes  degenerate  at 
the  boundary  of  the  simplex.  This  makes  the  mathematical  analysis  and  numerical  simulation  of 
these  processes  quite  difficult.  We  have  completed  the  foundational  work  on  the  mathematical 
analysis  of  these  diffusion  equations,  and  established  the  needed  connections  with  stochastic 
differential  equations  and  Markov  processes.  This  produces  explicit  estimates  for  the  transition 
kernel  for  the  Markov  process,  and  things  like  the  stationary  distribution  and  exit  times.  This 
work  is  described  in  a  monograph  published  by  Princeton  University  Press,  and  a  series  of 
publications  and  preprints. 

COLLABORATIONS  AND  LEVERAGED  FUNDING 

Several  publications  from  this  ARO  project  directly  impact  the  interpretation  of  evolution 
experiments  performed  by  other  scientists,  including  Chris  Marx  (Harvard),  Rich  Lenski 
(Michigan  State),  and  Tim  Cooper  (U.  Houston).  As  a  result  of  these  papers,  we  have  begun  to 
collaborate  and  draft  a  new  research  proposal  that  involves  tight  co-ordination  of  theory  and 
experiment,  with  Marx  and  Lenksi,  to  understand  and  the  outcomes  of  bacterial  evolution. 

TECHNOLOGY  TRANSFER 

None,  yet.  We  are  producing  statistical  techniques  and  algorithms  for  interpreting 
empirical  data,  which  will  likely  be  in  the  public  domain. 
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